This is a comperehensive exploratory analysis for the New York City Taxi Trip Duration competition with tidy R and ggplot2.
The goal of this playground challenge is to predict the duration of taxi rides in NYC based on features like trip coordinates or pickup date and time. The data comes in the shape of 1.5 million training observations (../input/train.csv) and 630k test observation (../input/test.csv). Each row contains one taxi trip.
In this notebook, we will first study and visualise the original data, engineer new features, and examine potential outliers. Then we add two external data sets on the NYC weather and on the theoretically fastest routes. We visualise and analyse the new features within these data sets and their impact on the target trip_duration values. Finally, we will make a brief excursion into viewing this challenge as a classification problem and finish this notebook with a simple XGBoost model that provides a basic prediction.
I hope that this notebook will help you in getting started with this challenge. There is lots of room for you to develop your own ideas for new features and visualisations. In particular the classification approach and the final model are only a basic starting point for you to improve and optimise them for better performance. As always, any feedback, questions, or constructive criticism are much appreciated.
Originally, this analysis contained a few hidden figures, due to the memory/run-time limitations of the Kernels environment at the time. This means that I had to disable a few auxilliary visualisations as the notebook grew towards its final stage. I tried to only remove those plots that didn’t affect the flow of the analysis too much. All of the corresponding code was still contained in this notebook and you can easily recover these hidden plots. Now, with the new and improved Kernel specs those plots are back in the “official” version. (On related matters, the movie hidden figures is a pretty awesome piece of history for maths (and space) geeks like us ;-) )
Finally, I would like to thank everyone of you for viewing, upvoting, or commenting on this kernel! I’m still a beginner here on Kaggle and your support means a lot to me :-) . Also, thanks to NockedDown for suggesting the new updated title pun in the comments! ;-)
Enough talk. Let’s get started:
library('ggplot2') # visualisation
library('scales') # visualisation
library('grid') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation
library('alluvial') # visualisation
library('dplyr') # data manipulation
library('readr') # input/output
library('data.table') # data manipulation
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation
library('lubridate') # date and time
library('geosphere') # geospatial locations
library('leaflet') # maps
library('leaflet.extras') # maps
library('maps') # maps
library('xgboost') # modelling
library('caret') # modellingWe use the multiplot function, courtesy of R Cookbooks to create multi-panel plots.
# Define multiple plot function
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),then plot 1 will go in the upper left, 2 will go in the upper right, and 3 will go all the way across the bottom.
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}We use data.table’s fread function to speed up reading in the data.
train <- as.tibble(fread('C:/Users/kojikm.mizumura/Desktop/Data Science/Kaggle/train.csv'))
test <- as.tibble(fread('C:/Users/kojikm.mizumura/Desktop/Data Science/Kaggle/test.csv'))
sample_submit <- as.tibble(fread('C:/Users/kojikm.mizumura/Desktop/Data Science/Kaggle/sample_submission.csv'))Let’s have an overview of the data sets using the summary and glimpse tools.
First the training data:
summary(train)## id vendor_id pickup_datetime dropoff_datetime
## Length:1458644 Min. :1.000 Length:1458644 Length:1458644
## Class :character 1st Qu.:1.000 Class :character Class :character
## Mode :character Median :2.000 Mode :character Mode :character
## Mean :1.535
## 3rd Qu.:2.000
## Max. :2.000
## passenger_count pickup_longitude pickup_latitude dropoff_longitude
## Min. :0.000 Min. :-121.93 Min. :34.36 Min. :-121.93
## 1st Qu.:1.000 1st Qu.: -73.99 1st Qu.:40.74 1st Qu.: -73.99
## Median :1.000 Median : -73.98 Median :40.75 Median : -73.98
## Mean :1.665 Mean : -73.97 Mean :40.75 Mean : -73.97
## 3rd Qu.:2.000 3rd Qu.: -73.97 3rd Qu.:40.77 3rd Qu.: -73.96
## Max. :9.000 Max. : -61.34 Max. :51.88 Max. : -61.34
## dropoff_latitude store_and_fwd_flag trip_duration
## Min. :32.18 Length:1458644 Min. : 1
## 1st Qu.:40.74 Class :character 1st Qu.: 397
## Median :40.75 Mode :character Median : 662
## Mean :40.75 Mean : 959
## 3rd Qu.:40.77 3rd Qu.: 1075
## Max. :43.92 Max. :3526282
glimpse(train)## Observations: 1,458,644
## Variables: 11
## $ id <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id <int> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime <chr> "2016-03-14 17:24:55", "2016-06-12 00:43:35...
## $ dropoff_datetime <chr> "2016-03-14 17:32:30", "2016-06-12 00:54:38...
## $ passenger_count <int> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
and then the testing data:
summary(test)## id vendor_id pickup_datetime passenger_count
## Length:625134 Min. :1.000 Length:625134 Min. :0.000
## Class :character 1st Qu.:1.000 Class :character 1st Qu.:1.000
## Mode :character Median :2.000 Mode :character Median :1.000
## Mean :1.535 Mean :1.662
## 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.000 Max. :9.000
## pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
## Min. :-121.93 Min. :37.39 Min. :-121.93 Min. :36.60
## 1st Qu.: -73.99 1st Qu.:40.74 1st Qu.: -73.99 1st Qu.:40.74
## Median : -73.98 Median :40.75 Median : -73.98 Median :40.75
## Mean : -73.97 Mean :40.75 Mean : -73.97 Mean :40.75
## 3rd Qu.: -73.97 3rd Qu.:40.77 3rd Qu.: -73.96 3rd Qu.:40.77
## Max. : -69.25 Max. :42.81 Max. : -67.50 Max. :48.86
## store_and_fwd_flag
## Length:625134
## Class :character
## Mode :character
##
##
##
glimpse(test)## Observations: 625,134
## Variables: 9
## $ id <chr> "id3004672", "id3505355", "id1217141", "id2...
## $ vendor_id <int> 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1...
## $ pickup_datetime <chr> "2016-06-30 23:59:58", "2016-06-30 23:59:53...
## $ passenger_count <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 4, 1, 1, 1, 1...
## $ pickup_longitude <dbl> -73.98813, -73.96420, -73.99744, -73.95607,...
## $ pickup_latitude <dbl> 40.73203, 40.67999, 40.73758, 40.77190, 40....
## $ dropoff_longitude <dbl> -73.99017, -73.95981, -73.98616, -73.98643,...
## $ dropoff_latitude <dbl> 40.75668, 40.65540, 40.72952, 40.73047, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
From the dataset summary, we find following points: - vendor_id only takes the value 1 or 2, presumably to differentiate two taxi companies - pickup_datetime and (in the training set) dropoff_datetime are combinations of date and time that we will have to reformat into a more useful shape. - passenger_count takes a median of 1 and a maximum of 9 in both data sets. - The pickup/dropoff_longitute/latitute describes the geographical coordinates where the meter was activate/deactivated. - store_and_fwd_flag is a flag that indicates whether the trip data was sent immediately to the vendor (“N”) or held in the memory of the taxi because there was no connection to the server (“Y”). Maybe there could be a correlation with certain geographical areas with bad reception? - trip_duration: our target feature in the training data is measured in seconds.
Knowing about missing values is important because they indicate how much we don’t know about our data. Making inferences based on just a few cases is often unwise. In addion, many modeling procedures break down when missing values are involved and the corresponding rows will either have to be removed completely or the values need to be estimated somehow.
Here, we are in the fortunate position that our data is complete, and there are no missing values:
sum(is.na(train))## [1] 0
sum(is.na(test))## [1] 0
In preparation for our eventunal modeling analysis, we combine the train and test datsets into a single one. I find it generally best not to examine the test data too closely, since this bears the risk of overfitting your analysis to this data. However, a few simple consistency checks between the two datasets can be of advantage.
combine <- bind_rows(train %>% mutate(dset="train"),
test %>% mutate(dset="test",
dropoff_datetime=NA,
trip_duration=NA))
combine <- combine %>%
mutate(dset=factor(dset))
head(combine)## # A tibble: 6 x 12
## id vendor_id pickup_datetime dropoff_datetime passenger_count
## <chr> <int> <chr> <chr> <int>
## 1 id28~ 2 2016-03-14 17:~ 2016-03-14 17:3~ 1
## 2 id23~ 1 2016-06-12 00:~ 2016-06-12 00:5~ 1
## 3 id38~ 2 2016-01-19 11:~ 2016-01-19 12:1~ 1
## 4 id35~ 2 2016-04-06 19:~ 2016-04-06 19:3~ 1
## 5 id21~ 2 2016-03-26 13:~ 2016-03-26 13:3~ 1
## 6 id08~ 2 2016-01-30 22:~ 2016-01-30 22:0~ 6
## # ... with 7 more variables: pickup_longitude <dbl>,
## # pickup_latitude <dbl>, dropoff_longitude <dbl>,
## # dropoff_latitude <dbl>, store_and_fwd_flag <chr>, trip_duration <int>,
## # dset <fct>
For our following analys, we will turn the data and time from characters into date objects. We also recode vendor_id as a factor. This makes it easier to visualize relationship that involve features.
head(train)## # A tibble: 6 x 11
## id vendor_id pickup_datetime dropoff_datetime passenger_count
## <chr> <int> <chr> <chr> <int>
## 1 id28~ 2 2016-03-14 17:~ 2016-03-14 17:3~ 1
## 2 id23~ 1 2016-06-12 00:~ 2016-06-12 00:5~ 1
## 3 id38~ 2 2016-01-19 11:~ 2016-01-19 12:1~ 1
## 4 id35~ 2 2016-04-06 19:~ 2016-04-06 19:3~ 1
## 5 id21~ 2 2016-03-26 13:~ 2016-03-26 13:3~ 1
## 6 id08~ 2 2016-01-30 22:~ 2016-01-30 22:0~ 6
## # ... with 6 more variables: pickup_longitude <dbl>,
## # pickup_latitude <dbl>, dropoff_longitude <dbl>,
## # dropoff_latitude <dbl>, store_and_fwd_flag <chr>, trip_duration <int>
train <- train %>%
mutate(pickup_datetime = ymd_hms(pickup_datetime),
dropoff_datetime = ymd_hms(dropoff_datetime),
vendor_id = factor(vendor_id),
passenger_count = factor(passenger_count))
glimpse(train)## Observations: 1,458,644
## Variables: 11
## $ id <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id <fct> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime <dttm> 2016-03-14 17:24:55, 2016-06-12 00:43:35, ...
## $ dropoff_datetime <dttm> 2016-03-14 17:32:30, 2016-06-12 00:54:38, ...
## $ passenger_count <fct> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
It is worth checking whether the trip_durations are consistent with the intervals between the pickup_datetime and dropoff_datetime. Presumably, the formaer were directly computed from the latter, but you never know.
Below, the check variable shows “TRUE” if the two intervals are not consistent:
train %>%
mutate(check = abs(int_length(interval(dropoff_datetime,pickup_datetime)) + trip_duration) > 0) %>%
select(check, pickup_datetime, dropoff_datetime, trip_duration) %>%
group_by(check) %>%
count()## # A tibble: 1 x 2
## # Groups: check [1]
## check n
## <lgl> <int>
## 1 FALSE 1458644
And we find that everything fits perfectly.
Visualization of feature distributions and their relations are key to understanding a dataset, and they often open open up new lines of inquiry. I always recommend to examine the data from as many different perspectives as possible to notice even usbtle trends and correlations.
In this section, we will begin by having a look at the distributions of the individual data features.
We start with a map of NYC and overlay a managable number of pickup coordinates to get a general overview of the locations and distances in question. For this visualisation we use the leaflet package, which includes a variety of cool tools for interactive maps. In this map you can zoom and pan through the pickup locations:
set.seed(1234)
foo <- sample_n(train,8e3)
head(foo)## # A tibble: 6 x 11
## id vendor_id pickup_datetime dropoff_datetime passenger_count
## <chr> <fct> <dttm> <dttm> <fct>
## 1 id03~ 1 2016-03-19 22:43:37 2016-03-19 23:07:58 2
## 2 id29~ 2 2016-04-20 01:43:57 2016-04-20 01:44:57 1
## 3 id22~ 1 2016-02-16 23:08:02 2016-02-16 23:21:26 1
## 4 id09~ 2 2016-06-30 20:11:02 2016-06-30 21:01:26 2
## 5 id35~ 1 2016-04-11 20:33:33 2016-04-11 20:53:44 3
## 6 id22~ 1 2016-03-19 21:44:23 2016-03-19 21:49:46 1
## # ... with 6 more variables: pickup_longitude <dbl>,
## # pickup_latitude <dbl>, dropoff_longitude <dbl>,
## # dropoff_latitude <dbl>, store_and_fwd_flag <chr>, trip_duration <int>
leaflet(data=foo) %>%
addTiles() %>%
addCircleMarkers(~pickup_longitude,~pickup_latitude,radius=1,
color="blue",fillOpacity = 0.3)Fig. 1
It only turns out that almost all of our trips were in facto taking place in Manhattan only. Another notable hot-spot is JFK airport towards the south-east of the city.
The map gives us an idea what some of our distributions could look like. Let’s start with plotting the target feature trip_duration:
train %>%
ggplot(aes(trip_duration))+
geom_histogram(fill="red")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Fig. 2
train %>%
ggplot(aes(trip_duration))+
geom_histogram(fill="red",bins=150)+
scale_x_log10()+
scale_y_sqrt()Fig. 2
Note the logarithmic x-axis and square-root y-axis.
From the above visualization, we obtain the following: - the majority of rides follow a rather smooth distribution that looks almost log-normal with a peak just short of 1000 seconds, i.e. about 17 minutes. - There are several suspiciously short rides with less than 10 seconds duration. - Additionally, there is a strange delta-shaped peak of trip_duration just before the 1e5 second mark and even a few way above it:
train %>%
arrange(desc(trip_duration)) %>%
select(trip_duration,pickup_datetime,dropoff_datetime,everything()) %>%
head(10)## # A tibble: 10 x 11
## trip_duration pickup_datetime dropoff_datetime id vendor_id
## <int> <dttm> <dttm> <chr> <fct>
## 1 3526282 2016-02-13 22:46:52 2016-03-25 18:18:14 id00~ 1
## 2 2227612 2016-01-05 06:14:15 2016-01-31 01:01:07 id13~ 1
## 3 2049578 2016-02-13 22:38:00 2016-03-08 15:57:38 id03~ 1
## 4 1939736 2016-01-05 00:19:42 2016-01-27 11:08:38 id18~ 1
## 5 86392 2016-02-15 23:18:06 2016-02-16 23:17:58 id19~ 2
## 6 86391 2016-05-31 13:00:39 2016-06-01 13:00:30 id05~ 2
## 7 86390 2016-05-06 00:00:10 2016-05-07 00:00:00 id09~ 2
## 8 86387 2016-06-30 16:37:52 2016-07-01 16:37:39 id28~ 2
## 9 86385 2016-06-23 16:01:45 2016-06-24 16:01:30 id13~ 2
## 10 86379 2016-05-17 22:22:56 2016-05-18 22:22:35 id25~ 2
## # ... with 6 more variables: passenger_count <fct>,
## # pickup_longitude <dbl>, pickup_latitude <dbl>,
## # dropoff_longitude <dbl>, dropoff_latitude <dbl>,
## # store_and_fwd_flag <chr>
Those records would correspond to 24-hour trips and beyond, with a maximum of almost 12 days. I know that rush hour can be bad, but those values are a little unbelievable.
Over the year, the distributions of pickup_datetime and dropoff_datetime look like this:
p1 <- train %>%
ggplot(aes(pickup_datetime))+
geom_histogram(fill="red",bins=120)+
labs(x="Pickup dates")
p2 <- train %>%
ggplot(aes(dropoff_datetime))+
geom_histogram(fill="blue",bins=120)+
labs(x="Dropoff dates")
layout <- matrix(c(1,2),2,1,byrow = F)
multiplot(p1,p2,layout=layout)Fig.3
p1 <- 1;p2 <- 1Fairly homogeneous, covering half a year between January and July 2016. There is an intersting drop around late January early February:
train %>%
filter(pickup_datetime>ymd("2016-01-20") & pickup_datetime<ymd("2016-02-10")) %>%
head(10)## # A tibble: 10 x 11
## id vendor_id pickup_datetime dropoff_datetime passenger_count
## <chr> <fct> <dttm> <dttm> <fct>
## 1 id08~ 2 2016-01-30 22:01:40 2016-01-30 22:09:03 6
## 2 id09~ 2 2016-02-03 16:22:50 2016-02-03 16:37:20 5
## 3 id20~ 2 2016-01-28 11:25:20 2016-01-28 11:30:56 1
## 4 id30~ 2 2016-01-20 19:21:31 2016-01-20 19:31:27 1
## 5 id22~ 1 2016-02-04 13:22:02 2016-02-04 13:40:30 1
## 6 id32~ 2 2016-01-27 08:07:32 2016-01-27 08:24:01 1
## 7 id06~ 1 2016-01-26 23:40:14 2016-01-26 23:48:39 1
## 8 id37~ 2 2016-01-30 15:15:47 2016-01-30 15:30:47 2
## 9 id11~ 2 2016-01-21 14:34:06 2016-01-21 14:44:10 1
## 10 id17~ 1 2016-02-09 15:54:53 2016-02-09 16:04:47 1
## # ... with 6 more variables: pickup_longitude <dbl>,
## # pickup_latitude <dbl>, dropoff_longitude <dbl>,
## # dropoff_latitude <dbl>, store_and_fwd_flag <chr>, trip_duration <int>
train %>%
filter(pickup_datetime>ymd("2016-01-20") & pickup_datetime<ymd("2016-02-10")) %>%
ggplot(aes(pickup_datetime))+
geom_histogram(fill="red",bins=120)Fig. 3b
That’s winter in NYC, so maybe snow storms or other heavy weather? Events like this should be taken into account, maybe through some handy external data set?
In the plot above, we can already see some daily and weekly modulations in the number of trips. Let’s investigate these variations together with the distributions of passenger_count and vendor_id by creating a multi-plot panel with different components:
head(train,10)## # A tibble: 10 x 11
## id vendor_id pickup_datetime dropoff_datetime passenger_count
## <chr> <fct> <dttm> <dttm> <fct>
## 1 id28~ 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1
## 2 id23~ 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1
## 3 id38~ 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1
## 4 id35~ 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1
## 5 id21~ 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1
## 6 id08~ 2 2016-01-30 22:01:40 2016-01-30 22:09:03 6
## 7 id18~ 1 2016-06-17 22:34:59 2016-06-17 22:40:40 4
## 8 id13~ 2 2016-05-21 07:54:58 2016-05-21 08:20:49 1
## 9 id13~ 1 2016-05-27 23:12:23 2016-05-27 23:16:38 1
## 10 id00~ 2 2016-03-10 21:45:01 2016-03-10 22:05:26 1
## # ... with 6 more variables: pickup_longitude <dbl>,
## # pickup_latitude <dbl>, dropoff_longitude <dbl>,
## # dropoff_latitude <dbl>, store_and_fwd_flag <chr>, trip_duration <int>
p1 <- train %>%
group_by(passenger_count) %>%
count() %>%
ggplot(aes(passenger_count,n,fill=passenger_count))+
geom_col()+
scale_y_sqrt()+
theme(legend.position = "none")
p2 <- train %>%
ggplot(aes(vendor_id,fill=vendor_id))+
geom_bar()+
theme(legend.position = "none")
p3 <- train %>%
ggplot(aes(store_and_fwd_flag)) +
geom_bar() +
theme(legend.position = "none") +
scale_y_log10()
p4 <- train %>%
mutate(wday=wday(pickup_datetime,label=TRUE)) %>%
group_by(wday,vendor_id) %>%
count() %>%
ggplot(aes(wday,n,colour=vendor_id))+
geom_point(size=4)+
labs(x="Day of the week",y="Total number of pickups")+
theme(legend.position = "none")
p5 <- train %>%
mutate(hpick = hour(pickup_datetime)) %>%
group_by(hpick, vendor_id) %>%
count() %>%
ggplot(aes(hpick, n, color = vendor_id)) +
geom_point(size = 4) +
labs(x = "Hour of the day", y = "Total number of pickups") +
theme(legend.position = "none")
layout <- matrix(c(1,2,3,4,5,6),3,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, layout=layout)Fig. 4
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1We find - There are a few trips with zero, or sven to nine passenger, but they are a rare exceptions.
train %>%
group_by(passenger_count) %>%
count() %>%
arrange(desc(n))## # A tibble: 10 x 2
## # Groups: passenger_count [10]
## passenger_count n
## <fct> <int>
## 1 1 1033540
## 2 2 210318
## 3 5 78088
## 4 3 59896
## 5 6 48333
## 6 4 28404
## 7 0 60
## 8 7 3
## 9 8 1
## 10 9 1
train %>%
group_by(store_and_fwd_flag) %>%
count()## # A tibble: 2 x 2
## # Groups: store_and_fwd_flag [2]
## store_and_fwd_flag n
## <chr> <int>
## 1 N 1450599
## 2 Y 8045
These numbers are equivalent to about half a percent of trips not being transmitted immediately.
The trip volume per hour of the day depends somewhat on the month and strongly on the day of the week:
p1 <- train %>%
mutate(hpick=hour(pickup_datetime),
Month=factor(month(pickup_datetime,label=TRUE))) %>%
group_by(hpick,Month) %>%
count() %>%
ggplot(aes(hpick,n,colour=Month))+
geom_line(size=1.5)+
labs(x="Hour of the day",y="count")
p2 <- train %>%
mutate(
hpick = hour(pickup_datetime),
wday = factor(wday(pickup_datetime, label = TRUE))) %>%
group_by(hpick,wday) %>%
count() %>%
ggplot(aes(hpick,n,colour=wday))+
geom_line(size=1.5)+
labs(x="Hour of the day",y="count")
layout <- matrix(c(1,2),2,1,byrow = F)
multiplot(p1,p2,layout = layout)Fig. 5
p1 <- 1;p2 <- 1We find: - January and June have fewer trips, whereas March and April are busier months. This tendency is observed for both vendor_ids.
Finally, we will look at simple overview visualization of the pickup/dropoff latitudes and longitudes:
p1 <- train %>%
filter(pickup_longitude > -74.05 & pickup_longitude < -73.7) %>%
ggplot(aes(pickup_longitude)) +
geom_histogram(fill = "red", bins = 40)
p2 <- train %>%
filter(dropoff_longitude > -74.05 & dropoff_longitude < -73.7) %>%
ggplot(aes(dropoff_longitude)) +
geom_histogram(fill = "blue", bins = 40)
p3 <- train %>%
filter(pickup_latitude > 40.6 & pickup_latitude < 40.9) %>%
ggplot(aes(pickup_latitude)) +
geom_histogram(fill = "red", bins = 40)
p4 <- train %>%
filter(dropoff_latitude > 40.6 & dropoff_latitude < 40.9) %>%
ggplot(aes(dropoff_latitude)) +
geom_histogram(fill = "blue", bins = 40)
layout <- matrix(c(1,2,3,4),2,2,byrow=FALSE)
multiplot(p1, p2, p3, p4, layout=layout)Fig. 6
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1Here we had constrain the range of latitude and longitude values, because there are a few cases which are way outside the NYC boundaries. The resulting distributions are consistent with the focus on Manhattan that we had already seen on the map. These are the most extreme values from the pickup_latitude feature:
train %>%
arrange(pickup_latitude) %>%
select(pickup_latitude,pickup_longitude) %>%
head(5)## # A tibble: 5 x 2
## pickup_latitude pickup_longitude
## <dbl> <dbl>
## 1 34.4 -65.8
## 2 34.7 -75.4
## 3 35.1 -71.8
## 4 35.3 -72.1
## 5 36.0 -77.4
train %>%
arrange(desc(pickup_latitude)) %>%
select(pickup_latitude,pickup_longitude) %>%
head(5)## # A tibble: 5 x 2
## pickup_latitude pickup_longitude
## <dbl> <dbl>
## 1 51.9 -72.8
## 2 44.4 -67.0
## 3 43.9 -71.9
## 4 43.5 -74.2
## 5 43.1 -72.6
We need to keep the existence of these (rather astonishing) values in mind so that they don’t bias our analysis.
While the previous section looked primarilty at the distributions of the infdividual features, here we will examine in more detail how those features are related to each other and to our target trip_duration.
How does the variation in trip numbers throughout the day and the week affect the average trip duration? Do quieter days and hours lead to faster trips? Here we include the vendor_id as an additional feature. Furthermore, for the hours of the day we add a smoothing layer to indicate the extent of the variation and its uncertainties:
p1 <- train %>%
mutate(wday=wday(pickup_datetime,label=T)) %>%
group_by(wday,vendor_id) %>%
summarise(median_duration=median(trip_duration)/60) %>%
ggplot(aes(wday,median_duration,color=vendor_id))+
geom_point(size=4)+
labs(x="Day of the week",y="Median trip duration [min]")
p2 <- train %>%
mutate(hpick=hour(pickup_datetime)) %>%
group_by(hpick,vendor_id) %>%
summarise(median_duration=median(trip_duration)/60) %>%
ggplot(aes(hpick,median_duration,color=vendor_id))+
geom_smooth(method="loess",span=1/2)+
geom_point(size=4)+
labs(x="Hour of the day",y="Median trip duration [min]")+
theme(legend.position="none")
layout <- matrix(c(1,2),2,1,byrow=FALSE)
multiplot(p1,p2,layout=layout)Fig. 7
p1 <- 1; p2 <- 1We find: - There is indeed a similar pattern as for the business of the day of the week. Vendor 2, the one with the more frequenst trips, also has consistently higher trip durations than vendor 1. It will be worth adding the vendor_id feature to a model to test its predictive importance. - Over the course of a typical day, we find a peak in the early afternoon and dips around 5-6am and 8am. The weekday and hour of a trip appear to be important features for predicting its duration and should be included in a successful model.
The next question we are asking is whether different numbers of passengers and/or the different vendors are correlated with the duration of the trip. We choose to examine this issue using a series of boxplots for the passenger_counts together with a facet wrap which contrasts the two vendor_ids:
train %>%
ggplot(aes(passenger_count, trip_duration, color=passenger_count))+
geom_boxplot()+
scale_y_log10()+
theme(legend.position="none")+
facet_wrap(~vendor_id)+
labs(y="Trip duration[s]",x="Number of passengers")Fig. 8
train %>%
ggplot(aes(trip_duration))+
geom_freqpoly()+
scale_x_log10()+
facet_wrap(~passenger_count)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Fig. 8
We find: - Both vendors have short trips without any passengers. - Vendor 1 has all of the trips beyond 24 hours, whereas vendor 2 has all of the (five) trips with more than six passengers and many more trips that approach 24-hour limit. - Between 1 and 6 passengers the median trips duractions are remarkably similar, in particular for vendor 2. There might be differences for vendor 1, but they are small (note the lograithmic y-axis):
train %>%
ggplot(aes(trip_duration,fill=vendor_id))+
geom_density(position="stack")+
scale_x_log10()Fig. 8b
Comparing the densities of the trip_duration distribution for the two vendors we find that the medians are very similar, whereas the means are likely skewed by vendor 2 containing most of the long-duration outliers:
train %>%
group_by(vendor_id) %>%
summarise(mean_duration=mean(trip_duration),
median_duration=median(trip_duration))## # A tibble: 2 x 3
## vendor_id mean_duration median_duration
## <fct> <dbl> <dbl>
## 1 1 845. 658
## 2 2 1059. 666
The temporary storing of the trip data only occured for Vendor 1:
train %>%
group_by(vendor_id,store_and_fwd_flag) %>%
count()## # A tibble: 3 x 3
## # Groups: vendor_id, store_and_fwd_flag [3]
## vendor_id store_and_fwd_flag n
## <fct> <chr> <int>
## 1 1 N 670297
## 2 1 Y 8045
## 3 2 N 780302
train %>%
filter(vendor_id==1) %>%
ggplot(aes(passenger_count,trip_duration,colour=passenger_count))+
geom_boxplot()+
scale_y_log10()+
facet_wrap(~store_and_fwd_flag)+
theme(legend.position = "none")+
labs(y="Trip duration[s]",x="Number of passengers")+
ggtitle("Store_and_fwd_flag impact")Fig. 9
We find that there is no overwhelming differences between the stored and non-stored trips. The stored ones might be slightly longer, though, and don’t include any of the suspiciously long trips.
In this section, we build new features from the existing ones, trying to find better predictions for our target variable. I prefer to define all these new features in a single code block below and then study them in the following subsections. This does not correspond to a linear analysis but it makes the notebook more readable and ensures that we don’t miss any stray feature definitions.
The new temporal features (date, month, wday, hour) are derived from the pickup_datetime. We got the JFK and LA Guardia airport corrdinates from Wikipedia. The blizzard feature is based on the external weather data.
jfk_coord <- tibble(lon=-73.778889, lat = 40.639722)
la_guardia_coord <- tibble(lon=-73.872611, lat = 40.77725)
pick_coord <- train %>%
select(pickup_longitude,pickup_latitude)
drop_coord <- train %>%
select(dropoff_longitude,dropoff_latitude)
train$dist <- distCosine(pick_coord,drop_coord)
train$bearing <- bearing(pick_coord,drop_coord)
train$jfk_dist_pick <- distCosine(pick_coord, jfk_coord)
train$jfk_dist_drop <- distCosine(drop_coord, jfk_coord)
train$lg_dist_pick <- distCosine(pick_coord, la_guardia_coord)
train$lg_dist_drop <- distCosine(drop_coord, la_guardia_coord)
train <- train %>%
mutate(speed=dist/trip_duration*3.6,
date=date(pickup_datetime),
month=month(pickup_datetime,label=T),
wday=wday(pickup_datetime,label=T),
wday=fct_relevel(wday,c("月曜日","火曜日","水曜日","木曜日","金曜日","土曜日","日曜日")),
#c("Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun")
hour=hour(pickup_datetime),
work=(hour %in% seq(8,18)) & (wday %in% c("月曜日","火曜日","水曜日","木曜日","金曜日")),
jfk_trip = (jfk_dist_pick<2e3) | (jfk_dist_drop<2e3),
lg_trip=(lg_dist_pick<2e3) | (lg_dist_drop<2e3),
blizzard=!(date<ymd("2016-01-22")|(date>ymd("2016-01-29")))
)## Warning: Unknown levels in `f`: 月曜日, 火曜日, 水曜日, 木曜日, 金曜日, 土
## 曜日, 日曜日
From the coordinates of the pickup and dropoff points, we can calculate the direct distance (as the crow flies) between the two points, and compare it to our trip_duration. Since taxies are’nt crows (in most practical scenarios), these values correspond to the minimum possible travel distance.
To compute these distances we are using the distCosine function of the geosphere package for spherical trigonometry. This method gives us the shortest distance between two points on a spherical earth. For the purpose of this localised analysis we choose to ignore ellipsoidal distortion of the earth’s shape. Here are the raw values of distance vs duration (based on a down-sized sample to speed up the kernel).
set.seed(4321)
train %>%
sample_n(5e4) %>%
ggplot(aes(dist,trip_duration))+
geom_point(alpha=0.5)+
scale_x_log10()+
scale_y_log10()+
labs(x="Direct distance [m]",y="Trip duration[s]")+
geom_smooth(method="gam")Fig. 10
We find: - The distance generally increases with increasing trip_duration - - Here, the 24-hour trips look even more suspicious and are even more likely to be artefacts in the data. - In addition, there are number of trips with very short distances, down to 1 metre, but with a large range of apparent trip_durations
Let’s filter the data a little bit to remove the extreme (and the extremely suspicious) data points, and bin the data into a 2-d histogram. This plot shows that in log-log space the trip_duration is increasing slower than linear for larger distance values:
train %>%
filter(trip_duration<3600 & trip_duration>120) %>%
filter(dist>200,dist<100e3) %>%
ggplot(aes(dist,trip_duration))+geom_bin2d(bins=c(500,500))Fig. 10b
train %>%
filter(trip_duration<3600 & trip_duration>120) %>%
filter(dist>200,dist<100e3) %>%
ggplot(aes(dist,trip_duration))+geom_bin2d(bins=c(500,500))+
scale_x_log10()+
scale_y_log10()+
labs(x="Direct distance[m]",y="Trip duration[s]")Fig. 10b
Distance over time is of course velocity, and by computing the average apparent velocity of our taxis we will have another diagnostic to remove bogus values.
Of course, we won’t be able to use speed as a predictor for our model, since it requires knowing the travel time, but it can still be helpful in cleaning up our training data and finding other features with predictive power. This is the speed distribution:
train %>%
filter(speed>2, speed<1e2) %>%
ggplot(aes(speed))+
geom_histogram(fill="red",bins=50,alpha=0.5)+
labs(x="Average speed[km/h](direct distance)")Fig. 11
Well, after removing the most extreme values this looks way better than I would have expected. An average speed of around 15 km/h sounds probably reasonable for NYC. Everything above 50 km/h certainly requires magical cars (or highway travel). Also keep in mind that this refers to the direct distance and that the real velocity would have been always higher.
In a similar way as the average duration per day and hour, we can also investigate the average speed for these time bins:
p1 <- train %>%
group_by(wday,vendor_id) %>%
summarise(median_speed=median(speed)) %>%
ggplot(aes(wday,median_speed,color=vendor_id))+
geom_point(size=4)+
labs(x="Day of the week",y="Median speed [km/h]")
p2 <- train %>%
group_by(hour,vendor_id) %>%
summarise(median_speed=median(speed)) %>%
ggplot(aes(hour,median_speed,color=vendor_id))+
geom_smooth(method="loess",span=1/2)+
geom_point(size=4)+
labs(x = "Hour of the day", y = "Median speed [km/h]")+
theme(legend.position = "none")
p3 <- train %>%
group_by(wday,hour) %>%
summarise(median_speed=median(speed)) %>%
ggplot(aes(hour,wday,fill=median_speed))+
geom_tile()+
labs(x="hour of the day",y="Day of the week")+
scale_fill_distiller(palette = "Spectral")
layout <- matrix(c(1,2,3,3),2,2,byrow=TRUE)
multiplot(p1, p2, p3, layout=layout)Fig. 12
p1 <- 1; p2 <- 1; p3 <- 1We find: - Our taxis appear to be traveling faster on the weekend and on a Monday than during the rest of the week. - The early morning ohurs allow for a speedier trip, with everything from 8am to 6pm being similarly slow. - There are almost no differences between the two vendors. - The heatmap in the lower panel visualises how these trends combine to create a “low-speed-zone” in the middle of the day and week. Based on this, we create a new feature work, which we define as 8am-6pm on Mon-Fri.
If the direct distance is the magnitude of the trip vector then the bearing is its (initial) direction. Easily estimated through the geosphere package, it tells us whether our trip started out for instance in the direction of North-West or South-East. Here we visualise the bearing distribution and its relation to trip_duration, direct distance, and speed:
p1 <- train %>%
filter(dist<1e5) %>%
ggplot(aes(bearing))+
geom_histogram(fill="red",bins=75)+
scale_x_continuous(breaks=seq(-180,180,by=45))+
labs(x="Bearing")
p2 <- train %>%
filter(dist < 1e5) %>%
ggplot(aes(bearing, dist)) +
geom_bin2d(bins = c(100,100)) +
labs(x = "Bearing", y = "Direct distance") +
scale_y_log10() +
theme(legend.position = "none") +
coord_polar() +
scale_x_continuous(breaks = seq(-180, 180, by = 45))
p3 <- train %>%
filter(trip_duration < 3600*22) %>%
filter(dist < 1e5) %>%
ggplot(aes(bearing, trip_duration)) +
geom_bin2d(bins = c(100,100)) +
scale_y_log10() +
labs(x = "Bearing", y = "Trip duration") +
coord_polar() +
scale_x_continuous(breaks = seq(-180, 180, by = 45))
p4 <- train %>%
filter(speed < 75 & dist < 1e5) %>%
ggplot(aes(bearing, speed)) +
geom_bin2d(bins = c(100,100)) +
labs(x = "Bearing", y = "Speed") +
coord_polar() +
scale_x_continuous(breaks = seq(-180, 180, by = 45))
layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)Fig. 13
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1This is one of the few cases where polar coordinates might be actually helpful. (Ignore the small discontinuities caused by the binning)
We find: - The bearing direction has two prominent peaks around 30 and -150 degrees. Intuitively, those might relate to the orientation of Manhattan. Three or four smaller, less sharp peaks are visible in between those larger peaks. - The 2-d histograms of distance and trip_duration don’t reveal much structure. There are indications for shorter durations and distances at the two peaks, but those might simply be caused by the larger number of observations in these peaks. Faint clusters towards higher trip_durations might be visible near -60 and 125 degrees bearing. - The bright “rings” we see in the distance and trip_duration reflect the familiar distribution peaks on the logarithmic scale. - Speed, on the other hand, shows an interestingly clustered structure along the bearing directions. Zones of higher speed can be identified at the aforementioned bearing peaks as well roughly perpendicular to them (most prominently around 125 degrees). It is possible that we are seeing the Manhattan street grid in this data. Note, that here the radial scale is not logarithmic.
In our maps (above) and trip paths(below) we noticed that a number of trips began or ended at either of the two NYC airports: JFK and La Guardia. Since airports are usually not in the city centre it is reasonable to asuume that the pickup/dropoff distance from the airport could be a useful predictor for longer trip_durations. Above, we defined the coordinates of the two airports and compute the corresponding distances:
p1 <- train %>%
ggplot(aes(jfk_dist_pick))+
geom_histogram(bins=30,fill="red",alpha=0.5)+
scale_x_log10()+
scale_y_sqrt()+
geom_vline(xintercept=2e3)+
labs(x="JFK pickup distance")
p2 <- train %>%
ggplot(aes(jfk_dist_drop)) +
geom_histogram(bins = 30, fill = "blue") +
scale_x_log10() +
scale_y_sqrt() +
geom_vline(xintercept = 2e3) +
labs(x = "JFK dropoff distance")
p3 <- train %>%
ggplot(aes(lg_dist_pick)) +
geom_histogram(bins = 30, fill = "red") +
scale_x_log10() +
scale_y_sqrt() +
geom_vline(xintercept = 2e3) +
labs(x = "La Guardia pickup distance")
p4 <- train %>%
ggplot(aes(lg_dist_drop)) +
geom_histogram(bins = 30, fill = "blue") +
scale_x_log10() +
scale_y_sqrt() +
geom_vline(xintercept = 2e3) +
labs(x = "La Guardia dropoff distance")
layout <- matrix(c(1,2,3,4),2,2,byrow=FALSE)
multiplot(p1, p2, p3, p4, layout=layout)Fig. 14a
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1Based on these numbers, we can define a JFK/La Guardia trip as having a pickup or dropoff distance of less than 2km from the corresponding airport.
What are the trip_durations of these journeys?
p1 <- train %>%
filter(trip_duration < 23*3600) %>%
ggplot(aes(jfk_trip, trip_duration, color = jfk_trip)) +
geom_boxplot() +
scale_y_log10() +
theme(legend.position = "none") +
labs(x = "JFK trip")
p2 <- train %>%
filter(trip_duration < 23*3600) %>%
ggplot(aes(lg_trip, trip_duration, color = lg_trip)) +
geom_boxplot() +
scale_y_log10() +
theme(legend.position = "none") +
labs(x = "La Guardia trip")
layout <- matrix(c(1,2),1,2,byrow=FALSE)
multiplot(p1, p2, layout=layout)Fig. 14
p1 <- 1; p2 <- 1We find that our hypothesis was correct and that trips to the airport, in particular, the more distant JFK, have significantly longer average trip_durations. These two features should definitely be part of our model.
Before we turn to the modelling it is time to clean up our trading data. We have waited to do this until now to have a more complete overview of the problematic observations. The aim here is to remove trips that have improbable features, such as extreme trip durations or very low average speed.
While the might also be a number of bogus trip durations in the test data we shouldn’t be able to predict them in any case (unless there were some real correlations). By removing these training data values, we will make our model more robust and more likely to generalize to unseen data, which is always our primary goal in machine learning.
Let’S visualize the distances of the trips that took a day or longer. Unless someone took a taxy from NYC to LA it is unlikely that those values are accurate. Here we make use of the maps package to draw an outline of Manhattan, where most of the trips begin or end. We then everlay the pickup coordinates in red, and the dropoff coordinates in blue.
To further add the trip connectins (direct distance) we use another handy tool from the geosphere package gcIntermediate allows us to interpolate the path between two sets of coordinates. I’ve seen this tool first in action in this he path between two sets of coordinates. I’ve seen this tool first in action in this truly outstanding kernel by Jonathan Bouchet. (If you haven’t seen his kernel, then I strongly recommend you check it out; right after reading this one, of course ;-) ).
We start with the few trips that pretend to have taken several days to complete:
days_plus_trips <- train %>%
filter(trip_duration>24*3600)
days_plus_trips %>%
select(pickup_datetime,dropoff_datetime,speed)## # A tibble: 4 x 3
## pickup_datetime dropoff_datetime speed
## <dttm> <dttm> <dbl>
## 1 2016-01-05 00:19:42 2016-01-27 11:08:38 0.0374
## 2 2016-02-13 22:38:00 2016-03-08 15:57:38 0.0105
## 3 2016-01-05 06:14:15 2016-01-31 01:01:07 0.00265
## 4 2016-02-13 22:46:52 2016-03-25 18:18:14 0.0203
ny_map <- as.tibble(map_data("state",region="new york:manhattan"))
tpick <- days_plus_trips %>%
select(lon=pickup_longitude,lat=pickup_latitude)
tdrop <- days_plus_trips %>%
select(lon = dropoff_longitude, lat = dropoff_latitude)
p1 <- ggplot()+
geom_polygon(data=ny_map,aes(x=long,y=lat),fill="grey60")+
geom_point(data=tpick,aes(x=lon,y=lat),size=1,color="red",alpha=1)+
geom_point(data=tdrop,aes(x=lon,y=lat),size=1,color='blue',alpha=1)
for (i in seq(1,nrow(tpick))){
inter <- as.tibble(gcIntermediate(tpick[i,],tdrop[i,],n=30,addStartEnd = T))
p1 <- p1+geom_line(data=inter,aes(x=lon,y=lat),color="blue",alpha=0.)
}
p1 + ggtitle("Longer than a day trips in relation to Manhattan")Fig. 15
p1 <- 1We find nothing out of the ordinary here. While a trip to JFK can seem like an eternity if your flight is boarding soon, it is unlikely to take this long in real time. The average taxi speeds don’t look very likely either.
Decision: These values should be removed from the training data set for continued exploration and modelling.
Call me crazy, but I don’t think it is inconceivable that someone takes a taxi for a trip that lasts almost a day (with breaks, of course). In very rare occations, this might happen; provided, that the distance travelled was sufficiently long.
Here we define day-long trips as taking bwtween 22 nad 24 hours, which covers a small peak in our raw trip_duration distribution. Those are the top 5 direct distance (in m) amoung the day-long trips:
day_trips <- train %>%
filter(trip_duration<24*3600 &trip_duration>22*3600)
day_trips %>%
arrange(desc(dist)) %>%
select(dist,pickup_datetime,dropoff_datetime,speed) %>%
head(5)## # A tibble: 5 x 4
## dist pickup_datetime dropoff_datetime speed
## <dbl> <dttm> <dttm> <dbl>
## 1 60666. 2016-06-04 13:54:29 2016-06-05 13:40:30 2.55
## 2 42401. 2016-01-28 21:43:02 2016-01-29 21:33:30 1.78
## 3 28116. 2016-03-14 22:46:05 2016-03-15 22:24:27 1.19
## 4 22808. 2016-06-18 17:41:47 2016-06-19 16:30:41 1.000
## 5 22759. 2016-06-01 19:52:42 2016-06-02 19:35:09 0.960
The top one is about 60km (about 37 miles), which is not particularly far. The average speed wouldn’t suggest a generous tip, either. What do these trips look like on the map?
ny_map <- as.tibble(map_data("state", region="new york:manhattan"))
set.seed(2017)
day_trips <- day_trips %>%
sample_n(200)
tpick <- day_trips %>%
select(lon=pickup_longitude,lat=pickup_latitude)
tdrop <- day_trips %>%
select(lon=dropoff_longitude,lat=dropoff_latitude)
p1 <- ggplot()+
geom_polygon(data=ny_map,aes(x=long,y=lat),fill="grey60")+
geom_point(data=tpick,aes(x=lon,y=lat),size=1,color="red",alpha=1)+
geom_point(data=tdrop,aes(x=lon,y=lat),size=1,color="blue",alpha=1)
for (i in seq(1,nrow(tpick))){
inter <- as.tibble(gcIntermediate(tpick[i,],tdrop[i,],n=30,addStartEnd = T))
p1 <- p1+geom_line(data=inter,aes(x=lon,y=lat),color="blue",alpha=.25)
}
p1 + ggtitle("Day-long trips in relation to Manhattan")Fig. 16
p1 <- 1Here we are plotting only 200 of the about 1800 connections to keep the map reasonably readable and the script fast. Pickup points are red and dropoff points are blue.
We find:
Decision: We will remove trip_durations longer than 22 hours from the exploration and possibly from the modelling.
On the other side of the trip_duration distribution we have those rides that appear to only have lasted for a couple of minutes. While such short trips are entirely possible, let’s check their durations and speeds to make sure that they are realistic.
min_trips <- train %>%
filter(trip_duration<5*60)
min_trips %>%
arrange(dist) %>%
select(dist,pickup_datetime,dropoff_datetime,speed) %>%
head(5)## # A tibble: 5 x 4
## dist pickup_datetime dropoff_datetime speed
## <dbl> <dttm> <dttm> <dbl>
## 1 0 2016-02-29 18:39:12 2016-02-29 18:42:59 0
## 2 0 2016-01-27 22:29:31 2016-01-27 22:29:58 0
## 3 0 2016-01-22 16:13:01 2016-01-22 16:13:20 0
## 4 0 2016-01-18 15:24:43 2016-01-18 15:28:57 0
## 5 0 2016-05-04 22:28:43 2016-05-04 22:32:51 0
In doing o, we notice that there are relatively large number of zero-dsitance trips:
zero_dist <- train %>%
filter(near(dist,0))
nrow(zero_dist)## [1] 5897
What are their nominal top durations?
zero_dist %>%
arrange(desc(trip_duration)) %>%
select(trip_duration,pickup_datetime,dropoff_datetime,vendor_id) %>%
head(5)## # A tibble: 5 x 4
## trip_duration pickup_datetime dropoff_datetime vendor_id
## <int> <dttm> <dttm> <fct>
## 1 86352 2016-06-05 01:09:39 2016-06-06 01:08:51 2
## 2 85333 2016-01-01 15:27:28 2016-01-02 15:09:41 2
## 3 78288 2016-05-18 13:40:45 2016-05-19 11:25:33 2
## 4 5929 2016-06-08 16:47:44 2016-06-08 18:26:33 2
## 5 4683 2016-05-25 17:36:49 2016-05-25 18:54:52 2
There really are few taxies where the data wants to tell us that they have not remived at all for about a day. While carrying a passenger. We choose not to believe the data in this case.
Once we remove the extreme cases, this is what the distribution looks like:
zero_dist %>%
filter(trip_duration<6000) %>%
ggplot(aes(trip_duration,fill=vendor_id))+
geom_histogram(bins=50)+
scale_x_log10()Fig. 17
zero_dist %>%
filter(trip_duration<6000) %>%
ggplot(aes(trip_duration,fill=vendor_id))+
geom_histogram(bins=50)Fig. 17
We find: - trip_durations of about a minute might still be somehow possible, assuming that someone got into a taxi but then changed their mind before the taxi could move. Whether that should count as a “trip” is different question. But trip durations of about 15 minutes(900s) without any distance covered seem hardly possible. Unless they invove traffic jams, but who get’s into a taxi that’S stuck in a traffic jam? - It is also noteworthy that most trips in the less-than-a-minute-group were from vendor 1, whereas the 10-minute-group predominatly consists of vendor 2 taxies.
Decision: We will remove those zero-distance trips that took more than a minute for our continued analysis. Removing them from the modelling might be detrimental if there are similar trips in the test sample.
After removing the zero-distance trips, those are the short rides with the highest average speed:
min_trips <- train %>%
filter(trip_duration<5*60 & dist>0)
min_trips %>%
arrange(desc(speed)) %>%
select(trip_duration,dist,pickup_datetime,speed) %>%
head(10)## # A tibble: 10 x 4
## trip_duration dist pickup_datetime speed
## <int> <dbl> <dttm> <dbl>
## 1 7 18055. 2016-02-13 20:28:30 9285.
## 2 282 320484. 2016-04-02 20:33:19 4091.
## 3 2 784. 2016-06-12 06:35:13 1412.
## 4 51 19970. 2016-06-19 20:18:35 1410.
## 5 279 104877. 2016-03-20 21:07:56 1353.
## 6 2 704. 2016-06-23 13:36:48 1267.
## 7 20 6919. 2016-05-28 15:14:19 1245.
## 8 3 914. 2016-05-30 17:12:12 1097.
## 9 4 1031. 2016-01-17 03:11:56 928.
## 10 3 762. 2016-06-13 16:34:30 914.
Clearly, the top speed values are impossible. We include a map plot of the general distribution of distances within the sample:
ny_map <- as.tibble(map_data("state",region="new york:manhattan"))
set.seed(1234)
foo <- min_trips %>%
sample_n(600)
tpick <- foo %>%
select(lon=pickup_longitude,lat=pickup_latitude)
tdrop <- foo %>%
select(lon=dropoff_longitude,lat=dropoff_latitude)
p1 <- ggplot()+
geom_polygon(data=ny_map,aes(x=long,y=lat),fill="grey60")+
geom_point(data=tpick,aes(x=lon,y=lat),size=1,color="red",alpha=1)+
geom_point(data=tdrop,aes(x=lon,y=lat),size=1,color="blue",alpha=1)
for (i in seq(1,nrow(tpick))){
inter <- as.tibble(gcIntermediate(tpick[i,],tdrop[i,],n=30,addStartEnd = T))
p1 <- p1+geom_line(data=inter,aes(x=lon,y=lat),color="blue",alpha=.25)
}
p1+ggtitle("Minute-long trips in relation to Manhattan")Fig. 17b
p1 <- 1We find (from the hidden plot) - Most distances are in fact short, which means that combined with setting an average speed limit we should be able to remove those values that are way beyond being realistic. This should also get rid of many trips that appear to have durations of seconds only.
Decision: We impose a lower trip_duration limit of 10 seconds and a (very conservative) speed limit of 100 km/h (62 mph). (Remember that this refers to the direct distance.)
Every data set has a few entries that are just flat out ridiculous. Here are the best ones from this one, with pickup or dropoff locations more than 300 km away from NYC (JFK airport)
long_dist <- train %>%
filter( (jfk_dist_pick>3e5) | (jfk_dist_drop>3e5))
long_dist_coord <- long_dist %>%
select(lon=pickup_longitude,lat=pickup_latitude)
long_dist %>%
select(id, jfk_dist_pick, jfk_dist_drop, dist, trip_duration, speed) %>%
arrange(desc(jfk_dist_pick))## # A tibble: 31 x 6
## id jfk_dist_pick jfk_dist_drop dist trip_duration speed
## <chr> <dbl> <dbl> <dbl> <int> <dbl>
## 1 id2854272 4128727. 4128719. 14.8 499 0.107
## 2 id3777240 4128721. 4128726. 21.8 1105 0.0711
## 3 id2306955 1253574. 21484. 1242299. 792 5647.
## 4 id1974018 1115646. 1115646. 0 369 0
## 5 id0267429 988748. 988748. 0 961 0
## 6 id0838705 695767. 481141. 215468. 1131 686.
## 7 id0205460 684133. 684133. 0 329 0
## 8 id0978162 674251. 941618. 315117. 875 1296.
## 9 id3525158 665877. 665877. 0 385 0
## 10 id1510552 642663. 472020. 892212. 611 5257.
## # ... with 21 more rows
Many zero-distance trips with more than a minute duration, which we would remove anyway. But just out of curiousity, where did they happen? (We will again use the amazing leaflet package, this time with individual markers that give us id and direct distance information for mouse-over and click actions.)
leaflet(long_dist_coord) %>%
addTiles() %>%
setView(-92.00,41.0,zoom=4) %>%
addMarkers(popup = ~as.character(long_dist$dist),label=~as.character(long_dist$id))Fig. 18
See what I mean?
Not only are there two! lone NYC taxis near San Francisco, but 9 others are actually ocean-going taxis, who knoew ;-). Most of the others will be removed by pur previous cleaning limits.
** These long-distance locations represent outliers that should be removed to improve the robustness of predictive models.**
Here we apply the cleaning filters that are discussed above. This code block is likely expand as the analysis progresses.
train <- train %>%
filter(trip_duration<22*3600,
dist>0 | (near(dist,0) & trip_duration<60),
jfk_dist_pick<3e5, jfk_dist_drop<3e5,
trip_duration>10,
speed<100)In this playground connection, we have benn encouraged to supplement our analysis with additonal data sources. These are publshed as Kaggle data set and are being collected in this discussion thread.
In fact, there are monetary prizes for publishing the top 4 data sets, which together with the Kernel awards gives this competition a fresh and interesting spin.
If you want to know more about how to include multiple data sources in your Kaggle kernel then check out this article.
We start by incorpoating a list of NYC weather data provided by Mathijs Waegemakers right here. In the next paragraph I copy the data description verbatim:
Weather data collected from the National Weather Service. It contains the first six months of 2016, for a weather station in central park. It contains for each day the minimum temperature, maximum temperature, average temperature, precipitation, new snow fall, and current snow depth. The temperature is measured in Fahrenheit and the depth is measured in inches. T means that there is a trace of precipitation.
Of particular interest here will be the rain and snow fall statistics, which I’m curious to compare to the dip in trip numbers in late January. Check also this kernel which presents an independent analysis of the same data set.
The data can be found in the data set sub-directory of the ../input directory and it looks like this:
weather <- as.tibble(fread("C:/Users/kojikm.mizumura/Desktop/Data Science/Kaggle/weather_data_nyc_centralpark_2016(1).csv"))glimpse(weather)## Observations: 366
## Variables: 7
## $ date <chr> "1-1-2016", "2-1-2016", "3-1-2016", "4-1...
## $ `maximum temperature` <int> 42, 40, 45, 36, 29, 41, 46, 46, 47, 59, ...
## $ `minimum temperature` <int> 34, 32, 35, 14, 11, 25, 31, 31, 40, 40, ...
## $ `average temperature` <dbl> 38.0, 36.0, 40.0, 25.0, 20.0, 33.0, 38.5...
## $ precipitation <chr> "0.00", "0.00", "0.00", "0.00", "0.00", ...
## $ `snow fall` <chr> "0.0", "0.0", "0.0", "0.0", "0.0", "0.0"...
## $ `snow depth` <chr> "0", "0", "0", "0", "0", "0", "0", "0", ...
We turn the date into a lubridate object and convert the traces (“T”) of rain and snow into small numeric amounts. We also save the maximum and minimum temperature in a shorter form:
weather <- weather %>%
mutate(date=dmy(date),
rain=as.numeric(ifelse(precipitation=="T","0.01",precipitation)),
s_fall = as.numeric(ifelse(`snow fall` == "T", "0.01", `snow fall`)),
s_depth = as.numeric(ifelse(`snow depth` == "T", "0.01", `snow depth`)),
all_precip=s_fall+rain,
has_snow=(s_fall>0) | (s_depth>0),
has_rain=rain>0,
max_temp=`maximum temperature`,
min_temp = `minimum temperature`
)Then we join this information to our training data using the common data column:
foo <- weather %>%
select(date,rain,s_fall,all_precip,has_snow,has_rain,s_depth,max_temp,min_temp)
train <- left_join(train, foo, by = "date")Let’s compare the snow fall statistics to our trip numbers per day:
p1 <- train %>%
group_by(date) %>%
count() %>%
ggplot(aes(date,n/1e3))+
geom_line(size=1.5,color="red")+
labs(x="",y="Kilo trips per day")
p2 <- train %>%
group_by(date) %>%
summarise(trips=n(),
snow_fall=mean(s_fall),
rain_fall=mean(rain),
all_precip=mean(all_precip)) %>%
ggplot(aes(date,snow_fall))+
geom_line(color="blue",size=1.5)+
labs(x="",y="snowfall")+
scale_y_sqrt()+
scale_x_date(limits=ymd(c("2015-12-28", "2016-06-30")))
p3 <- train %>%
group_by(date) %>%
summarise(trips=n(),
snow_depth=mean(s_depth)) %>%
ggplot(aes(date,snow_depth))+
geom_line(color="purple",size=1.5)
p4 <- train %>%
group_by(date) %>%
summarise(median_speed=median(speed)) %>%
ggplot(aes(date,median_speed))+
geom_line(color="orange",size=1.5)+
labs(x="Date",y="Median speed")
layout <- matrix(c(1,2,3,4),4,1,byrow=FALSE)
multiplot(p1, p2, p3, p4, layout=layout)Fig. 19
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1I knew it! The dip in trip volume corresponds to the largest (and first?) snow fall of the winter in NYC (Jan 23rd). In fact, NYC was hit by a blizzard and experienced record-breaking snowfall. The impact on the traffic patterns was enormous, and the median speed slowed down notably. (Note the square-root y-axis in the two middle plots.)
After this large spike, we clearly see that the following periods of snow fall, albeit signficantly less heavy, have nowhere near as large an effect on the number of taxi trips or their velocities. A possible exception might have been the last snow in mid March, which due to the fact that it hadn’t snowed in a while might have caught people by surprise.
Now, do lower numbers of trips in snowy weather also lead to longer journeys? To answer this question we look at a scatter plot between the average trip duration and the total precipitation (rain + snow) for all days in our sample:
train %>%
group_by(date,has_snow) %>%
summarise(duration=mean(trip_duration),
all_precip=mean(all_precip)) %>%
ggplot(aes(all_precip,duration,color=has_snow))+
geom_jitter(width=0.04,size=2)+
scale_x_sqrt()+
scale_y_log10()+
labs(x = "Amount of total precipitation", y = "Average trip duration")Fig. 20
We find that except for a snow day with long-duration trips it seems more like snow would lead to shorter trips. However, this could simply mean that passengers were more likely to travel shorter distances, so how does the average speed compare?
p1 <- train %>%
filter(speed < 50) %>%
ggplot(aes(has_snow, speed, color = has_snow)) +
geom_boxplot() +
theme(legend.position = "none") +
labs(x = "Snowfall")
p2 <- train %>%
filter(speed < 50) %>%
ggplot(aes(has_rain, speed, color = has_rain)) +
geom_boxplot() +
theme(legend.position = "none") +
labs(x = "Rainfall")
layout <- matrix(c(1,2),1,2,byrow=FALSE)
multiplot(p1, p2, layout=layout)Fig. 21
p1 <- 1; p2 <- 1We find that there is no significant difference here. Moreover, Jan 23rd (blizzard) is not at the top of the list of snow days with the longest median trip_duration (it’s in the top 10, though).
train %>%
filter(has_snow == TRUE) %>%
group_by(date) %>%
summarise(med_duration = median(trip_duration),
med_speed = median(speed)) %>%
arrange(desc(med_duration)) %>%
head(10)## # A tibble: 10 x 3
## date med_duration med_speed
## <date> <dbl> <dbl>
## 1 2016-01-26 824 10.6
## 2 2016-01-25 805 10.8
## 3 2016-01-27 752 10.9
## 4 2016-01-28 732 11.5
## 5 2016-01-29 705 11.4
## 6 2016-02-11 670 11.7
## 7 2016-03-04 663 12.0
## 8 2016-02-23 662 11.4
## 9 2016-01-23 660 12.4
## 10 2016-01-19 656 12.1
We can spot a different interesting pattern in this table, though, when looking at the top 5 slowest snow days. Interestingly, the overall lowest median speeds were observed two days after the blizzard: from Jan 25th until 27th (although note that the day immediately after the blizzard, Jan 24th, was a Sunday):
train %>%
group_by(date) %>%
summarise(med_duration=median(trip_duration),
med_speed=median(speed)) %>%
arrange(med_speed) %>%
head()## # A tibble: 6 x 3
## date med_duration med_speed
## <date> <dbl> <dbl>
## 1 2016-01-26 824 10.6
## 2 2016-01-25 805 10.8
## 3 2016-01-27 752 10.9
## 4 2016-06-08 768 10.9
## 5 2016-06-01 761 11.2
## 6 2016-05-26 747 11.3
Conclusion: From an exploratory point of view this doesn’t look too promising. Still, I’m inclined to include the rain and snow fall occurence in a tentative model to see whether it shows any interaction terms that we can’t see at the moment. The “blizzard effect” on the (working) week after the heavy snow fall is probably worth taking into account separately. Thus, we create a blizzard feature in our engineering code block.
Another interesting external data set is an estimate of the fastest routes for each trip provided by oscarleo using the the Open Source Routing Machine, OSRM. The data can be found here and includes the pickup/dropoff streets and total distance/duration between these two points together with a sequence of travels steps such as turns or entering a highway.
Note, that according to discussion items those really are the “fastest”, not the shortest, routes between the two points. This will most likely not account for traffic volume, and the fastest route on one day might not be the fastest on another day of the week. Still, here we have an actual driving distance and duration measure that should be valuable for our prediction goal.
The data comes in three separate files, split into the train (2 files) vs test trip IDs. Here, we will load and examine the data corresponding to the training observations. Note, that this data set is more than twice as large as our original training data.
foo <- as.tibble(fread("C:/Users/kojikm.mizumura/Desktop/Data Science/Kaggle/fastest_routes_train_part_1.csv"))
bar <- as.tibble(fread("C:/Users/kojikm.mizumura/Desktop/Data Science/Kaggle/fastest_routes_train_part_2.csv"))
foobar <- as.tibble(fread("C:/Users/kojikm.mizumura/Desktop/Data Science/Kaggle/fastest_routes_test.csv"))
fastest_route <- bind_rows(foo,bar,foobar)glimpse(fastest_route)## Observations: 2,083,777
## Variables: 12
## $ id <chr> "id2875421", "id2377394", "id3504673", "i...
## $ starting_street <chr> "Columbus Circle", "2nd Avenue", "Greenwi...
## $ end_street <chr> "East 65th Street", "Washington Square We...
## $ total_distance <dbl> 2009.1, 2513.2, 1779.4, 1614.9, 1393.5, 1...
## $ total_travel_time <dbl> 164.9, 332.0, 235.8, 140.1, 189.4, 138.8,...
## $ number_of_steps <int> 5, 6, 4, 5, 5, 5, 2, 13, 6, 9, 4, 4, 10, ...
## $ street_for_each_step <chr> "Columbus Circle|Central Park West|65th S...
## $ distance_per_step <chr> "0|576.4|885.6|547.1|0", "877.3|836.5|496...
## $ travel_time_per_step <chr> "0|61.1|60.1|43.7|0", "111.7|109|69.9|25....
## $ step_maneuvers <chr> "depart|rotary|turn|new name|arrive", "de...
## $ step_direction <chr> "left|straight|right|straight|arrive", "n...
## $ step_location_list <chr> "-73.982316,40.767869|-73.981997,40.76768...
The glimpse view shous us that indeed for each step the streets are named(streets_for_each_step) alongside a detailed account of the distance_per_step, travel_time_per_step, and the step_maneuvers and step_direction describing the journey. The step_maneuvers are named in a pretty self-explanatory way and are also described in the data set overview. Depart and arrive are individual steps. The direction of a step_maneuver “turn” is listed in the the step_directions. The total distance is measured in meters and the duration in seconds, like in the original data.
Before joining the “fastest routes” data to our training set, let’s first visualise the distributions of the numerical parameters:
p1 <- fastest_route %>%
ggplot(aes(total_travel_time))+
geom_histogram(bins=150,fill="red")+
scale_x_log10()+
scale_y_sqrt()
p2 <- fastest_route %>%
ggplot(aes(total_distance))+
geom_histogram(bins = 150, fill = "red") +
scale_x_log10() +
scale_y_sqrt()
p3 <- fastest_route %>%
ggplot(aes(number_of_steps)) +
geom_bar(fill = "red")
p4 <- fastest_route %>%
ggplot(aes(total_distance, total_travel_time)) +
geom_bin2d(bins = c(80,80))
layout <- matrix(c(1,2,3,4,4,4),2,3,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)Fig. 22
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1We find: - The total_travel_time and total_distance peak at values of around 300 s (5 min) and 2800 m, respectively. These are the median values. Really high values are rare in both features, however there is are indications for secondary peaks towards longer times and distances. Very short distances (few meters) and travel times (few seconds) are present in the data. - - The single most frequent number_of_steps is 5. Since “depart” and “arrive” are always present, there are no trips with less than 2 steps. There are almost 60k trips that only contain these two steps, which would mean that they would have at most travelled in a straight line, as far as I can see:
fastest_route %>% filter(number_of_steps==2) %>% nrow()## [1] 83340
fastest_route %>%
ggplot(aes(factor(number_of_steps),total_travel_time,color=factor(number_of_steps)))+
geom_boxplot()+
labs(x="number_of_steps")+
theme(legend.position="none")Fig.23
We also find that the median travel time correlates well with the number_of_steps, but that there is also much overlap between neighbouring step counts, along with numerous outliers predominantly towards longer durations.
The distributions of total_distance per number_of_steps look very similar to the plot above.
In a first look at the joined data, we will mainly focus on the total_distance and total_travel_time, when combined with our original training data set:
foo <- fastest_route %>%
select(id, total_distance, total_travel_time, number_of_steps,
step_direction, step_maneuvers) %>%
mutate(fastest_speed = total_distance/total_travel_time*3.6,
left_turns = str_count(step_direction, "left"),
right_turns = str_count(step_direction, "right"),
turns = str_count(step_maneuvers, "turn")
) %>%
select(-step_direction, -step_maneuvers)
train <- left_join(train, foo, by = "id") %>%
mutate(fast_speed_trip = total_distance/trip_duration*3.6)In addition,we also create the following features: - fastest_speed is the average speed of the fastest route based on the OSRM values, and fast_speed_trip is the speed assuming the fastest route and the actual trip duration. Both features are in units of km/h. - left_turns, right_turns, and turns are the total number of those maneuvers per route. The step_directions “left” and “right” might not always be associated to the step_maneuver “turn”, but in most cases they are so we are using them as a first estimate of the number of left and right turns. Especially left turns might slow down travel.
Beyond that, for instance the detailed list of streets used and their respective travel_time_per_step might allow us to take into account known areas of slow traffic.
p1 <- train %>%
ggplot(aes(trip_duration)) +
geom_density(fill = "red", alpha = 0.5) +
geom_density(aes(total_travel_time), fill = "blue", alpha = 0.5) +
scale_x_log10() +
coord_cartesian(xlim = c(5e1, 8e3))
p2 <- train %>%
ggplot(aes(dist)) +
geom_density(fill = "red", alpha = 0.5) +
geom_density(aes(total_distance), fill = "blue", alpha = 0.5) +
scale_x_log10() +
coord_cartesian(xlim = c(2e2, 5e4))
p3 <- train %>%
ggplot(aes(trip_duration, total_travel_time)) +
geom_bin2d(bins = c(120,120)) +
geom_abline(intercept = 0, slope = 1, colour = "red") +
theme(legend.position = "none")
p4 <- train %>%
ggplot(aes(dist, total_distance)) +
geom_bin2d(bins = c(70,70)) +
geom_abline(intercept = 0, slope = 1, colour = "red") +
theme(legend.position = "none")
layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, layout=layout)Fig. 24
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1Here the fastest features are in blue and the original trip_duration and direct distance in red. In the binned scatter plots the red line has slope 1.
We find:
What about the speed? Comparing our original speed estimate to the fastest_speed from the OSRM data should give us an idea how realistic the values of the former feature are.
For this figure, in the kernel-density plot the fastest values are in blue and the red line in the binned scatter plot has slope equal one:
p1 <- train %>%
ggplot(aes(speed))+
geom_density(fill="red",alpha=0.5)+
geom_density(aes(fastest_speed),fill="blue",alpha=0.5)+
coord_cartesian(xlim=c(0,80))
p2 <- train %>%
ggplot(aes(speed,fastest_speed))+
geom_bin2d(bins=c(50,50))+
geom_abline(intercept=0,slope=1,colour="red")+
theme(legend.position="none")
layout <- matrix(c(1,2),1,2,byrow=TRUE)
multiplot(p1, p2, layout=layout)Fig. 24b
p1 <- 1; p2 <- 1Here we find several effects:
At face value, the impact of the number of turns on the total_travel_time is very similar for left_turns and right_turns. A main reason behind this will be that the total_travel_time increases with increasing number_of_steps of any kind and that the number of turns is of course correlated with the number_of_steps. We will learn more from the percentage of turns per trip, which we will investigate in a moment. From these plots we find the following results:
p1 <- train %>%
ggplot(aes(factor(left_turns),total_travel_time,color=factor(left_turns)))+
geom_boxplot()+
labs(x="Number of left turns")+
theme(legend.position="none")
p2 <- train %>%
ggplot(aes(factor(right_turns), total_travel_time, color = factor(right_turns))) +
geom_boxplot() +
labs(x = "Number of right turns") +
theme(legend.position = "none")
p3 <- train %>%
ggplot(aes(factor(turns), total_travel_time, color = factor(turns))) +
geom_boxplot() +
labs(x = "Total number of turns") +
theme(legend.position = "none")
layout <- matrix(c(1,2,3,3),2,2,byrow=TRUE)
multiplot(p1, p2, p3, layout=layout)Fig. 24c
p1 <- 1; p2 <- 1; p3 <- 1Now let’s move on to the turn percentages:
foo <- train %>%
filter(!is.na(left_turns)) %>%
mutate(left_frac = left_turns/number_of_steps,
right_frac = right_turns/number_of_steps)
p1 <- foo %>%
ggplot(aes(left_frac)) +
geom_histogram(bins = 20, fill = "red") +
labs(x = "Left turns / all steps")
p2 <- foo %>%
ggplot(aes(right_frac)) +
geom_histogram(bins = 20, fill = "red") +
labs(x = "Right turns / all steps")
p3 <- foo %>%
ggplot(aes(left_frac, total_travel_time)) +
geom_bin2d(bins = c(40,40)) +
theme(legend.position = "none") +
labs(x = "Left turns / all steps")
p4 <- foo %>%
ggplot(aes(right_frac, total_travel_time)) +
geom_bin2d(bins = c(40,40)) +
theme(legend.position = "none") +
labs(x = "Right turns / all steps")
p5 <- foo %>%
ggplot(aes(left_frac, fastest_speed)) +
geom_bin2d(bins = c(40,40)) +
theme(legend.position = "none") +
labs(x = "Left turns / all steps")
p6 <- foo %>%
ggplot(aes(right_frac, fastest_speed)) +
geom_bin2d(bins = c(40,40)) +
theme(legend.position = "none") +
labs(x = "Right turns / all steps")
layout <- matrix(c(1,2,3,4,5,6),3,2,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, p6, layout=layout)Fig. 25
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1; foo <- 1We find:
n this section we will compare the OSRM total_travel_time to the taxi trip_duration in our original data. We will specifically focus on understanding those curious cases that suggest significant delays in our data.
Here we plot the a scatter plot of the two timing features colour-coded by the percentage of left turns in the the OSRM route (the solid red line has again a slope of 1):
train %>%
ggplot(aes(trip_duration, total_travel_time, color = left_turns / number_of_steps)) +
geom_point(alpha = 0.3) +
geom_abline(intercept = 0, slope = 1, colour = "red") +
scale_colour_gradient(low = "blue", high = "red") +
scale_x_log10() +
scale_y_log10() +
geom_hline(yintercept = c(50,2000), linetype = 2) +
labs(color = "Left turns [%]")Fig. 26
Note the double-log axes.
We find: - There is a population of data point where the trip_duration (the actual timing of the taxi ride) is lower than the total_travel_time based on the fastest OSRM route. These differences can be significant for the trips that take less than 1000 s (i.e. about 17 min) on the fastest route but become less severe for rides longer than that. - There are no “fastest” routes with more than 10 ks total_travel_time. The maximum in the cleaned sample is at about 5000 s or 1.5 h. Almost all of the significant long-delay outliers in trip_duration (indicated by the horizontal dashed lines) lie in the range of 100 - 2000 s total_travel_time and extend beyond 10 ks trip_duration. - There appears to be a tendency for longer trips (especially in total_travel_time) to contain a higher fraction of left_turns. However, this apparent effect might be due to the shortest rides containing no turns and longer rides (above 1 min) including an average number of turns. In any case, within the long-delay outliers there is no obvious impact of the percentage of left_turns.
The most prominent outliers only make up a very small percentage of the overall rides:
train %>%
filter(total_travel_time>50, total_travel_time<2000) %>%
mutate(delay=trip_duration>1e4) %>%
group_by(delay) %>%
count()## # A tibble: 2 x 2
## # Groups: delay [2]
## delay n
## <lgl> <int>
## 1 FALSE 1434779
## 2 TRUE 207
We will briefly examine those data points using a comparison of relative frequencies in bar plots and density overlays:
foo <- train %>%
filter(total_travel_time>50, total_travel_time<2000) %>%
mutate(delay=trip_duration>1e4)
head(foo)## # A tibble: 6 x 43
## id vendor_id pickup_datetime dropoff_datetime passenger_count
## <chr> <fct> <dttm> <dttm> <fct>
## 1 id28~ 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1
## 2 id23~ 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1
## 3 id38~ 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1
## 4 id35~ 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1
## 5 id21~ 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1
## 6 id08~ 2 2016-01-30 22:01:40 2016-01-30 22:09:03 6
## # ... with 38 more variables: pickup_longitude <dbl>,
## # pickup_latitude <dbl>, dropoff_longitude <dbl>,
## # dropoff_latitude <dbl>, store_and_fwd_flag <chr>, trip_duration <int>,
## # dist <dbl>, bearing <dbl>, jfk_dist_pick <dbl>, jfk_dist_drop <dbl>,
## # lg_dist_pick <dbl>, lg_dist_drop <dbl>, speed <dbl>, date <date>,
## # month <ord>, wday <ord>, hour <int>, work <lgl>, jfk_trip <lgl>,
## # lg_trip <lgl>, blizzard <lgl>, rain <dbl>, s_fall <dbl>,
## # all_precip <dbl>, has_snow <lgl>, has_rain <lgl>, s_depth <dbl>,
## # max_temp <int>, min_temp <int>, total_distance <dbl>,
## # total_travel_time <dbl>, number_of_steps <int>, fastest_speed <dbl>,
## # left_turns <int>, right_turns <int>, turns <int>,
## # fast_speed_trip <dbl>, delay <lgl>
p1 <- foo %>%
ggplot(aes(vendor_id, group = delay)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~delay) +
theme(legend.position = "none")
p2 <- foo %>%
ggplot(aes(jfk_trip, group = delay)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~delay) +
theme(legend.position = "none")
p3 <- foo %>%
ggplot(aes(lg_trip, group = delay)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~delay) +
theme(legend.position = "none")
p4 <- foo %>%
ggplot(aes(work, group = delay)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~delay) +
theme(legend.position = "none")
p5 <- foo %>%
ggplot(aes(total_distance, fill = delay)) +
geom_density(alpha = 0.5)
layout <- matrix(c(1,2,3,4,5,5),2,3,byrow=TRUE)
multiplot(p1, p2, p3, p4, p5, layout=layout)p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1Here the facets (true vs false) refer to the observations being part of the long-delay group or not.
We find: - - Almost all long delays happen for the taxis of vendor 2, even though vendor 2 has only a slightly higher percentage of rides overall. This discrepancy is definitely significant enough to explain most of the effect. - Another interesting fact is that in the long-delay sample the fraction of JFK trips is significantly higher, and the fraction of La Guardia trips is slightly higher, than in the general population. We could hypothesise that for an airport journey the trip_duration might include the waiting time while queuing for passengers. JFK trips make up almost 30% of the long-delay trips (60 out of 207). - Ther percentage of delays during work hours is slightly higher but not with practical significance. - Furthermore, in the long-delay distribution the long distances are clearly favoured over shorter ones. Remember that here we are only comparing trips within the same bracket of total_travel_time (50 - 2000 s). - In addition, not shown here is the observation that long delays appear to be slightly more frequent on Tuesdays or in January and March; compared to Wednesday and Thursdays or February. However, considering the number of categories and that the deviations at most reach 5 percentage points these effect might be random.
Another curious group of outliers can be found in the lower middle portion of the trip_duration vs total_travel_time plot. Here we encode the colour of the trips by their vendor_id, to further highlight the impact of vendor 2 on the long-duration outliers on the top right and the few ones on the bottom right:
train %>%
ggplot(aes(trip_duration, total_travel_time, color = vendor_id)) +
geom_point(alpha = 0.3) +
geom_abline(intercept = 0, slope = 1, colour = "red") +
scale_x_log10() +
scale_y_log10() +
geom_hline(yintercept = c(10), linetype = 2) +
geom_vline(xintercept = c(600,5000), linetype = 2)Fig. 28a
The second group of outliers lies at total_travel_time < 10 and trip_duration = 600 - 4000: units in seconds at usual. See the hidden figure or fig 26. Therefore, those are trips over very short distances that should have only taken a few seconds, but in our data they took between 10 minutes and 1.4 hours. In a way, there is a continuum of data points in the lower left to middle part of the plot, but the ones over 10 minutes shows another peak in density:
train %>%
filter(total_travel_time<10, trip_duration<10000) %>%
ggplot(aes(trip_duration))+
geom_density(fill="red")+
geom_vline(xintercept=c(600,5000),linetype=2)+
scale_x_log10()Fig. 28
foo <- train %>%
filter(total_travel_time<10) %>%
mutate(delay=trip_duration>600)
p1 <- foo %>%
ggplot(aes(vendor_id, group=delay))+
geom_bar(aes(y=..prop.., fill=factor(..x..)),stat="count")+
scale_y_continuous(labels=scales::percent)+
ylab("relative frequencies")+
facet_grid(~delay)+
theme(legend.position = "none")
p2 <- foo %>%
ggplot(aes(wday, group=delay))+
geom_bar(aes(y=..prop.., fill=factor(..x..)),stat="count")+
scale_y_continuous(labels=scales::percent)+
ylab("relative frequencies")+
facet_grid(~delay)+
theme(legend.position = "none")
p3 <- foo %>%
ggplot(aes(passenger_count, group=delay))+
geom_bar(aes(y=..prop..,fill=factor(..x..)),stat="count")+
scale_y_continuous(labels=scales::percent)+
ylab("relative frequencies")+
facet_grid(~delay)+
theme(legend.position = "none")
p4 <- foo %>%
ggplot(aes(total_distance,fill=delay))+
geom_density(alpha=0.5)+
coord_cartesian(xlim=c(0,100))
layout <- matrix(c(1,1,2,2,2,2,3,3,3,4,4,4),2,6,byrow=TRUE)
multiplot(p1,p2,p3,p4, layout=layout)Fig. 29
p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1; p6 <- 1We find: - Again, vendor 2 has systematically longer trip_durations than vendor 1. - Delays of this second kind appear to be more frequent on weekends, compared to the overall distribution. - The number of passengers is higher too for these delays, so maybe we are counting the time it takes them (and their luggage) to get in and out of the taxi? - The distances for the delays are again longer, but notice the different scale here. Distances of less than 100 m should not take more than 10 min. - Not shown: jfk_trip and lg_trip have no impact here because the overall distances are so short.
In summary, we find that the vendor_id has distinguishing power for the two outlier groups, with airport trips very likely to contribute to the long-delay outliers. Other factors like wday or passenger_count might contribute additional signal.
After engineering new features and before starting the modelling, we will visualise the relations between our parameters using a correlation matrix. For this, we need to change all the input features into a numerical format. The visualisation uses the corrplot function from the eponymous package. Corrplot gives us great flexibility in manipulating the style of our plot.
What we see below, are the colour-coded correlation coefficients for each combination of two features. In simplest terms: this shows whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero the weaker is the correlation. Both 1 and -1 are the ideal cases of perfect correlation and anti-correlation (dark blue and dark red in the plots below).
Here, we are of course interested if and how strongly our features correlate with the trip_duration, the prediction of which is the ultimate goal of this challenge. But we also want to know whether our potential predictors are correlated among each other, so that we can reduce the collinearity in our data set and improve the robustness of our prediction:
train %>%
select(-id,-pickup_datetime,-dropoff_datetime,-jfk_dist_pick,
-jfk_dist_drop, -lg_dist_pick, -lg_dist_drop, -date) %>%
mutate(
passenger_count=as.integer(passenger_count),
vendor_id=as.integer(vendor_id),
store_and_fwd_flag=as.integer(as.factor(store_and_fwd_flag)),
jfk_trip = as.integer(jfk_trip),
wday = as.integer(wday),
month = as.integer(month),
work = as.integer(work),
lg_trip = as.integer(lg_trip),
blizzard = as.integer(blizzard),
has_snow = as.integer(has_snow),
has_rain = as.integer(has_rain)) %>%
select(trip_duration,speed,everything()) %>%
cor(use="complete.obs",method="spearman") %>%
corrplot(type="lower",method="circle",diag=FALSE)Fig. 30a
If we remove the features with little to no significant impact on other features that are not directly related then our (effective) correlation matrix looks like this. In this plot (and in the hidden figure) the fading serves the purpose that everything that would be hard to see is not worth seeing:
foo <- train %>%
select(-id,-pickup_datetime,-dropoff_datetime,-jfk_dist_pick,
-jfk_dist_drop, -lg_dist_pick, -lg_dist_drop, -date,
-store_and_fwd_flag, -hour, -rain, -s_fall, -all_precip,
-has_rain, -s_depth, -min_temp, -max_temp,
-wday, -right_turns, -turns, -fast_speed_trip) %>%
mutate(passenger_count = as.integer(passenger_count),
vendor_id = as.integer(vendor_id),
jfk_trip = as.integer(jfk_trip),
lg_trip = as.integer(lg_trip),
month = as.integer(month),
work = as.integer(work),
blizzard = as.integer(blizzard),
has_snow = as.integer(has_snow)) %>%
select(trip_duration, speed, everything())
foo %>%
cor(use="complete.obs", method = "spearman") %>%
corrplot(type="lower", method="circle", diag=FALSE)## Warning in cor(., use = "complete.obs", method = "spearman"): the standard
## deviation is zero
foo <- 1We find: - The strongest correlations with the trip_duration are seen for the direct distance as well tas the total_distance and total_travel_time derived from the OSRM data. As we have already seen above, the distance are highly correlated with one another and with the total_travel_time. Also the number of turns, presumably mostly via the number_of_steps, are having an impact on the trip_duration. - - Another effect on the trip_duration can bee seen for our engineered features jfk_trip and lg_trip; indicating journeys to either airport. A similar statement is true for the average speed and airport travel. - The pickup and dropoff coordinates are correlated, which is a bit puzzling but might be partly explained by the shape of Manhattan stretching from south-west to north-east. Another part of the explanation might be short trips within lower or upper Manhattan only. - vendor_id is correlated with passenger_count because of vendor 2 having all the (five) trips with more than 6 passengers. - Among the weather data (mostly seen in the full hidden plot), the minimum and maximum temperatures show a strong correlation with each other and the month of the year, which is intuitively understandable. Also the different ways of measuring precipitation are naturally correlated with each other and, in case of snow, with the month.
We should be able to see these relations in our model.
Our challenge is by nature a regression problem: we want to predict a continuous variable, trip_duration from a set of features. In this kernel, we will continue to treat it as a regression problem, but in this section, I want to present a brief excusion into the possibility of tunining it into a classification problem.
In a classification context, our target variable can only take a (small) number of discrete values. Often, we are faced with a binary outcome, for instance “Survived vs Not survived” in the popular Titanic Challenge. Here, we want to examine which factors contribute to shorter or longer trip_durations.
Recall the distribution of trip_durations from the beginning of this notebook,
train_group <- train %>%
mutate(tgroup=case_when(trip_duration<3e2 ~ "fast",
trip_duration>=3e2 & trip_duration<=1.6e3 ~ "mid",
trip_duration>1.6e3 ~ "slow"))
train_group %>%
ggplot(aes(trip_duration,fill=tgroup))+
geom_histogram(bins=300)+
scale_x_log10()+
scale_y_sqrt()Fig.31
Ignore, the apparent spikes caused by the binnning resolution.
For the purpose of this exercise, we are not interested in the middle bit. Our aim is to compare the properties of the last vs slow trips.
train_group <- train_group %>%
filter(tgroup!="mid")
p1 <- train_group %>%
ggplot(aes(wday,fill=tgroup))+
geom_bar(position="fill")+
theme(legend.position="none")
p2 <- train_group %>%
ggplot(aes(month,fill=tgroup))+
geom_bar(position="fill")+
theme(legend.position = "none")
p3 <- train_group %>%
ggplot(aes(hour, fill = tgroup)) +
geom_bar(position = "fill")
p4 <- train_group %>%
ggplot(aes(jfk_trip, fill = tgroup)) +
geom_bar(position = "fill") +
theme(legend.position = "none")
p5 <- train_group %>%
ggplot(aes(lg_trip, fill = tgroup)) +
geom_bar(position = "fill") +
theme(legend.position = "none")
p6 <- train_group %>%
ggplot(aes(blizzard, fill = tgroup)) +
geom_bar(position = "fill") +
theme(legend.position = "none")
p7 <- train_group %>%
ggplot(aes(work, fill = tgroup)) +
geom_bar(position = "fill") +
theme(legend.position = "none")
layout <- matrix(c(1,1,2,2,3,3,3,3,4,5,6,7),3,4,byrow = TRUE)
multiplot(p1,p2,p3,p4,p5,p6,p7,layout=layout)Fig.32
We find: - Our airpot features, jfk_trip and lg_trip, behave as expected and are predominantly slow trips. Also the blizzard feature shows a promising increase in slow trips. - Throughout the week, there is a notable difference between the weekend and the weekdays. It might be worth to encode the wday feature according to this trend. We see that the monthly trend already shows a gradually increasing “slow” proportion. - During the day, we find again the difference between night and (working) day that we had seen in previously in the scatter plot plus heatmap and which is partly resolved in our engineered work feature.
For a visualization of the pickup coordinates for the fast vs slow rides we build another map using the great leaflet package. Those maps don’t require external API call and run fully contained in a Kaggle kernel. We visualize our two groups of trips using a heatmap which we add through tools in the leaflet.extras package.
In this leaflet visualisation, you can toggle and overlay the fast vs slow rides independently, and also choose a dark vs light background map (in addition to a free zoom and pan). These layers are easy to define within a leaflet object; have a look at the code to see how concise the addLayersControl element is. Zoom in to see street-level resolution:
pick_fast <- train_group %>%
filter(tgroup=="fast") %>%
select(long=pickup_longitude,lat=pickup_latitude)
pick_slow <- train_group %>%
filter(tgroup=="slow") %>%
select(long=pickup_longitude,lat=pickup_latitude)
leaflet(pick_fast) %>%
# addProviderTiles(providers$CartoDB.DarkMatter, group = "Dark map") %>%
# addProviderTiles(providers$CartoDB.Positron, group = "Light map") %>%
addTiles(group="Dark map") %>%
addTiles(group="Light map") %>%
addScaleBar() %>%
setView(-73.95, 40.74, zoom = 11) %>%
addHeatmap(max = 140, gradient = "Reds", radius = 12,
minOpacity = 0.15, blur = 15, group = "Fast trips") %>%
addHeatmap(lng = pick_slow$long, lat = pick_slow$lat,
max = 100, gradient = "Blues", radius = 12,
minOpacity = 0.25, blur = 15, group = "Slow trips") %>%
addLayersControl(
baseGroups = c("Dark Map", "Light map"),
overlayGroups = c("Fast trips", "Slow trips"),
options = layersControlOptions(collapsed = FALSE)
)